GradQuant: Tidyverse & Data Wrangling

Introduction

What is the “tidyverse”?

The tidyverse is “an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.”

About a dozen packages that make up the official tidyverse; however, there are dozens of tidyverse-adjacent packages that follow this philosophy, grammar, and data structures and work well with the official tidyverse packages.

Features

tidy data: a data frame where each row is an observation and each column contains the value of a single variable

pipe operator: %>%, channels the flow of tidy operations

uniform standards for package organization, testing procedures, coding style, etc.

Advantages

  • Ease of use, path of least resistance

  • “Programs must be written for people to read and only incidentally for machines to execute”

  • Parsimony, simplicitiy

Principles of Tidy Data

Benefits of Tidy Data

  1. Tidy data have a consistent data structure - This eliminates the many different ways in which data can be stored. By imposing a uniform data structure, the cognitive load imposed on the analyst is minimized for each new project.
  2. Tidy data foster tool development - Software that all work within the tidy data framework can all work well with one another, even when developed by different individuals, ultimately increasing the variety and scope of tools available, without requiring analysts to learn an entirely new mental model with each new tool.
  3. Tidy data require only a small set of tools to be learned - When using a consistent data format, only a small set of tools is required and these tools can be reused from one project to the next.
  4. Tidy data allow for datasets to be combined - Data are often stored in multiple tables or in different locations. By getting each table into a tidy format, combining across tables or sources becomes trivial.

Rules for Storing Tidy Data

  1. Be consistent
  2. Choose good names for things
  3. Write dates as YYYY-MM-DD
  4. No empty cells
  5. Put just one thing in a cell
  6. Don’t use font color or highlighting as data
  7. Save the data as plain text files

Common problems with messy datasets

  1. Column headers are values but should be variable names.
  2. A single column has multiple variables.
  3. Variables have been entered in both rows and columns.
  4. Multiple “types” of data are in the same spreadsheet.
  5. A single observation is stored across multiple spreadsheets.

Tidyverse’s Tibble: A New Dataframe

A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code. Tibbles also have an enhanced print() method which makes them easier to use with large datasets containing complex objects.

  • Input type remains unchanged - data.frame is notorious for treating strings as factors; this will not happen with tibbles
  • Variable names remain unchanged - In base R, creating data.frames will remove spaces from names, converting them to periods or add “x” before numeric column names. Creating tibbles will not change variable (column) names.
  • There are no row.names() for a tibble - Tidy data requires that variables be stored in a consistent way, removing the need for row names.
  • Tibbles print first ten rows and columns that fit on one screen - Printing a tibble to screen will never print the entire huge data frame out. By default, it just shows what fits to your screen.

Tidy Basics

Install and load

# install.packages("tidyverse")
library(tidyverse)

Tibble

as_tibble(iris)
# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ 140 more rows

You can create a tibble just like a dataframe

df <- tibble(
  x = 1:5, 
  y = 1, 
  z = x ^ 2 + y
)
df
# A tibble: 5 × 3
      x     y     z
  <int> <dbl> <dbl>
1     1     1     2
2     2     1     5
3     3     1    10
4     4     1    17
5     5     1    26

Pipes

Pipes are written as %>%and they should be read as “and then”.

df %>% .$x
[1] 1 2 3 4 5

Basic Data Management: MUTATE

Mutate: Create a new variable

mutate() can be used to create variables based on existing variables from the dataset.

airquality %>% 
  mutate(totalDays = 30.5 * Month + Day)
    Ozone Solar.R Wind Temp Month Day totalDays
1      41     190  7.4   67     5   1     153.5
2      36     118  8.0   72     5   2     154.5
3      12     149 12.6   74     5   3     155.5
4      18     313 11.5   62     5   4     156.5
5      NA      NA 14.3   56     5   5     157.5
6      28      NA 14.9   66     5   6     158.5
7      23     299  8.6   65     5   7     159.5
8      19      99 13.8   59     5   8     160.5
9       8      19 20.1   61     5   9     161.5
10     NA     194  8.6   69     5  10     162.5
11      7      NA  6.9   74     5  11     163.5
12     16     256  9.7   69     5  12     164.5
13     11     290  9.2   66     5  13     165.5
14     14     274 10.9   68     5  14     166.5
15     18      65 13.2   58     5  15     167.5
16     14     334 11.5   64     5  16     168.5
17     34     307 12.0   66     5  17     169.5
18      6      78 18.4   57     5  18     170.5
19     30     322 11.5   68     5  19     171.5
20     11      44  9.7   62     5  20     172.5
21      1       8  9.7   59     5  21     173.5
22     11     320 16.6   73     5  22     174.5
23      4      25  9.7   61     5  23     175.5
24     32      92 12.0   61     5  24     176.5
25     NA      66 16.6   57     5  25     177.5
26     NA     266 14.9   58     5  26     178.5
27     NA      NA  8.0   57     5  27     179.5
28     23      13 12.0   67     5  28     180.5
29     45     252 14.9   81     5  29     181.5
30    115     223  5.7   79     5  30     182.5
31     37     279  7.4   76     5  31     183.5
32     NA     286  8.6   78     6   1     184.0
33     NA     287  9.7   74     6   2     185.0
34     NA     242 16.1   67     6   3     186.0
35     NA     186  9.2   84     6   4     187.0
36     NA     220  8.6   85     6   5     188.0
37     NA     264 14.3   79     6   6     189.0
38     29     127  9.7   82     6   7     190.0
39     NA     273  6.9   87     6   8     191.0
40     71     291 13.8   90     6   9     192.0
41     39     323 11.5   87     6  10     193.0
42     NA     259 10.9   93     6  11     194.0
43     NA     250  9.2   92     6  12     195.0
44     23     148  8.0   82     6  13     196.0
45     NA     332 13.8   80     6  14     197.0
46     NA     322 11.5   79     6  15     198.0
47     21     191 14.9   77     6  16     199.0
48     37     284 20.7   72     6  17     200.0
49     20      37  9.2   65     6  18     201.0
50     12     120 11.5   73     6  19     202.0
51     13     137 10.3   76     6  20     203.0
52     NA     150  6.3   77     6  21     204.0
53     NA      59  1.7   76     6  22     205.0
54     NA      91  4.6   76     6  23     206.0
55     NA     250  6.3   76     6  24     207.0
56     NA     135  8.0   75     6  25     208.0
57     NA     127  8.0   78     6  26     209.0
58     NA      47 10.3   73     6  27     210.0
59     NA      98 11.5   80     6  28     211.0
60     NA      31 14.9   77     6  29     212.0
61     NA     138  8.0   83     6  30     213.0
62    135     269  4.1   84     7   1     214.5
63     49     248  9.2   85     7   2     215.5
64     32     236  9.2   81     7   3     216.5
65     NA     101 10.9   84     7   4     217.5
66     64     175  4.6   83     7   5     218.5
67     40     314 10.9   83     7   6     219.5
68     77     276  5.1   88     7   7     220.5
69     97     267  6.3   92     7   8     221.5
70     97     272  5.7   92     7   9     222.5
71     85     175  7.4   89     7  10     223.5
72     NA     139  8.6   82     7  11     224.5
73     10     264 14.3   73     7  12     225.5
74     27     175 14.9   81     7  13     226.5
75     NA     291 14.9   91     7  14     227.5
76      7      48 14.3   80     7  15     228.5
77     48     260  6.9   81     7  16     229.5
78     35     274 10.3   82     7  17     230.5
79     61     285  6.3   84     7  18     231.5
80     79     187  5.1   87     7  19     232.5
81     63     220 11.5   85     7  20     233.5
82     16       7  6.9   74     7  21     234.5
83     NA     258  9.7   81     7  22     235.5
84     NA     295 11.5   82     7  23     236.5
85     80     294  8.6   86     7  24     237.5
86    108     223  8.0   85     7  25     238.5
87     20      81  8.6   82     7  26     239.5
88     52      82 12.0   86     7  27     240.5
89     82     213  7.4   88     7  28     241.5
90     50     275  7.4   86     7  29     242.5
91     64     253  7.4   83     7  30     243.5
92     59     254  9.2   81     7  31     244.5
93     39      83  6.9   81     8   1     245.0
94      9      24 13.8   81     8   2     246.0
95     16      77  7.4   82     8   3     247.0
96     78      NA  6.9   86     8   4     248.0
97     35      NA  7.4   85     8   5     249.0
98     66      NA  4.6   87     8   6     250.0
99    122     255  4.0   89     8   7     251.0
100    89     229 10.3   90     8   8     252.0
101   110     207  8.0   90     8   9     253.0
102    NA     222  8.6   92     8  10     254.0
103    NA     137 11.5   86     8  11     255.0
104    44     192 11.5   86     8  12     256.0
105    28     273 11.5   82     8  13     257.0
106    65     157  9.7   80     8  14     258.0
107    NA      64 11.5   79     8  15     259.0
108    22      71 10.3   77     8  16     260.0
109    59      51  6.3   79     8  17     261.0
110    23     115  7.4   76     8  18     262.0
111    31     244 10.9   78     8  19     263.0
112    44     190 10.3   78     8  20     264.0
113    21     259 15.5   77     8  21     265.0
114     9      36 14.3   72     8  22     266.0
115    NA     255 12.6   75     8  23     267.0
116    45     212  9.7   79     8  24     268.0
117   168     238  3.4   81     8  25     269.0
118    73     215  8.0   86     8  26     270.0
119    NA     153  5.7   88     8  27     271.0
120    76     203  9.7   97     8  28     272.0
121   118     225  2.3   94     8  29     273.0
122    84     237  6.3   96     8  30     274.0
123    85     188  6.3   94     8  31     275.0
124    96     167  6.9   91     9   1     275.5
125    78     197  5.1   92     9   2     276.5
126    73     183  2.8   93     9   3     277.5
127    91     189  4.6   93     9   4     278.5
128    47      95  7.4   87     9   5     279.5
129    32      92 15.5   84     9   6     280.5
130    20     252 10.9   80     9   7     281.5
131    23     220 10.3   78     9   8     282.5
132    21     230 10.9   75     9   9     283.5
133    24     259  9.7   73     9  10     284.5
134    44     236 14.9   81     9  11     285.5
135    21     259 15.5   76     9  12     286.5
136    28     238  6.3   77     9  13     287.5
137     9      24 10.9   71     9  14     288.5
138    13     112 11.5   71     9  15     289.5
139    46     237  6.9   78     9  16     290.5
140    18     224 13.8   67     9  17     291.5
141    13      27 10.3   76     9  18     292.5
142    24     238 10.3   68     9  19     293.5
143    16     201  8.0   82     9  20     294.5
144    13     238 12.6   64     9  21     295.5
145    23      14  9.2   71     9  22     296.5
146    36     139 10.3   81     9  23     297.5
147     7      49 10.3   69     9  24     298.5
148    14      20 16.6   63     9  25     299.5
149    30     193  6.9   70     9  26     300.5
150    NA     145 13.2   77     9  27     301.5
151    14     191 14.3   75     9  28     302.5
152    18     131  8.0   76     9  29     303.5
153    20     223 11.5   68     9  30     304.5

Mutate: Create a new variable

Another example

diamonds %>% 
  mutate(price200 = price - 200)
# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z price200
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43      126
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31      126
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31      127
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63      134
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75      135
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48      136
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47      136
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53      137
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49      137
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39      138
# ℹ 53,930 more rows

Mutate: Create multiple new variables

diamonds %>% 
  mutate(price200 = price - 200,        # $200 OFF from the original price
         price20perc = price * 0.20,    # 20% of the original price
         price20percoff = price * 0.80, # 20% OFF from the original price 
         pricepercarat = price / carat, # ratio of price to carat
         pizza = depth ^ 2)             # Square the original depth
# A tibble: 53,940 × 15
   carat cut       color clarity depth table price     x     y     z price200
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43      126
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31      126
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31      127
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63      134
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75      135
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48      136
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47      136
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53      137
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49      137
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39      138
# ℹ 53,930 more rows
# ℹ 4 more variables: price20perc <dbl>, price20percoff <dbl>,
#   pricepercarat <dbl>, pizza <dbl>

Mutate: Nesting a Function

diamonds %>% 
  mutate(m = mean(price))
# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z     m
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43 3933.
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31 3933.
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31 3933.
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63 3933.
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75 3933.
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48 3933.
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47 3933.
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53 3933.
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49 3933.
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39 3933.
# ℹ 53,930 more rows

Mutate: Nesting a Function

diamonds %>% 
  mutate(m = mean(price),     # calculates the mean price
         sd = sd(price),      # calculates standard deviation
         med = median(price)) # calculates the median price
# A tibble: 53,940 × 13
   carat cut       color clarity depth table price     x     y     z     m    sd
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43 3933. 3989.
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31 3933. 3989.
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31 3933. 3989.
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63 3933. 3989.
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75 3933. 3989.
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48 3933. 3989.
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47 3933. 3989.
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53 3933. 3989.
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49 3933. 3989.
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39 3933. 3989.
# ℹ 53,930 more rows
# ℹ 1 more variable: med <dbl>

Mutate: Recode

modifies the values within a variable. Below, “Ideal” is now “IDEAL”

diamonds %>% 
  mutate(cut.new = recode(cut,
                          "Ideal" = "IDEAL"))
# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z cut.new  
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <ord>    
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43 IDEAL    
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31 Premium  
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31 Good     
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63 Premium  
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75 Good     
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48 Very Good
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47 Very Good
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53 Very Good
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49 Fair     
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39 Very Good
# ℹ 53,930 more rows

Mutate: Recoding multiple values at once

diamonds %>% 
  mutate(cut.new = recode(cut,
                          "Ideal" = "IDEAL",
                          "Fair" = "Okay",
                          "Premium" = "pizza"))
# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z cut.new  
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <ord>    
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43 IDEAL    
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31 pizza    
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31 Good     
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63 pizza    
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75 Good     
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48 Very Good
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47 Very Good
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53 Very Good
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49 Okay     
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39 Very Good
# ℹ 53,930 more rows

Basic Data Management: SUMMARIZE

collapses all rows and returns a one-row summary. R will recognize both the British and American spelling (summarise/summarize).

diamonds %>% 
  summarize(avg.price = mean(price))
# A tibble: 1 × 1
  avg.price
      <dbl>
1     3933.

Similar to mutate(), we can also perform multiple operations with summarize() and nest other useful functions inside it:

diamonds %>% 
  summarize(avg.price = mean(price),     # average price of all diamonds
            dbl.price = mean(price) * 2, # calculating double the average price
            random.add = 1 + 2,          # a math operation without an existing variable 
            avg.carat = mean(carat),     # average carat size of all diamonds
            stdev.price = sd(price))     # calculating the standard deviation 
# A tibble: 1 × 5
  avg.price dbl.price random.add avg.carat stdev.price
      <dbl>     <dbl>      <dbl>     <dbl>       <dbl>
1     3933.     7866.          3     0.798       3989.

Basic Data Management: GROUP BY

Group By and Mutate

mtcars %>% 
  group_by(gear) %>%
  mutate(aveMPG = mean(mpg, na.rm = T)) %>%
  ungroup()
# A tibble: 32 × 12
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb aveMPG
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4   24.5
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4   24.5
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1   24.5
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1   16.1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2   16.1
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1   16.1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4   16.1
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2   24.5
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2   24.5
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4   24.5
# ℹ 22 more rows

Group By and Summarise

mtcars %>% 
  group_by(gear) %>%
  summarise(aveMPG = mean(mpg, na.rm = T)) %>%
  ungroup()
# A tibble: 3 × 2
   gear aveMPG
  <dbl>  <dbl>
1     3   16.1
2     4   24.5
3     5   21.4

Always Ungroup after Grouping!

If you forget to ungroup(), future calculations on that object will be grouped!

Create new tibble

## Creating identification number to represent 50 individual people
ID <- c(1:50)

## Creating sex variable (25 males/25 females)
Sex <- rep(c("male", "female"), 25) # rep stands for replicate

## Creating age variable (20-39 year olds)
Age <- c(26, 25, 39, 37, 31, 34, 34, 30, 26, 33, 
         39, 28, 26, 29, 33, 22, 35, 23, 26, 36, 
         21, 20, 31, 21, 35, 39, 36, 22, 22, 25, 
         27, 30, 26, 34, 38, 39, 30, 29, 26, 25, 
         26, 36, 23, 21, 21, 39, 26, 26, 27, 21) 

## Creating a dependent variable called Score
Score <- c(0.010, 0.418, 0.014, 0.090, 0.061, 0.328, 0.656, 0.002, 0.639, 0.173, 
           0.076, 0.152, 0.467, 0.186, 0.520, 0.493, 0.388, 0.501, 0.800, 0.482, 
           0.384, 0.046, 0.920, 0.865, 0.625, 0.035, 0.501, 0.851, 0.285, 0.752, 
           0.686, 0.339, 0.710, 0.665, 0.214, 0.560, 0.287, 0.665, 0.630, 0.567, 
           0.812, 0.637, 0.772, 0.905, 0.405, 0.363, 0.773, 0.410, 0.535, 0.449)

## Creating a unified dataset that puts together all variables
data <- tibble(ID, Sex, Age, Score)

Group By and Summarize

Another example

data %>% 
  group_by(Sex) %>% 
  summarize(m = mean(Score), # calculates the mean
            s = sd(Score),   # calculates the standard deviation
            n = n())     # calculates the total number of observations
# A tibble: 2 × 4
  Sex        m     s     n
  <chr>  <dbl> <dbl> <int>
1 female 0.437 0.268    25
2 male   0.487 0.268    25

Basic Data Management: FILTER

Filter: if cut is ‘Fair’

Only retain specific rows of data that meet the specified requirement(s).

diamonds %>% filter(cut == "Fair")
# A tibble: 1,610 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 2  0.86 Fair  E     SI2      55.1    69  2757  6.45  6.33  3.52
 3  0.96 Fair  F     SI2      66.3    62  2759  6.27  5.95  4.07
 4  0.7  Fair  F     VS2      64.5    57  2762  5.57  5.53  3.58
 5  0.7  Fair  F     VS2      65.3    55  2762  5.63  5.58  3.66
 6  0.91 Fair  H     SI2      64.4    57  2763  6.11  6.09  3.93
 7  0.91 Fair  H     SI2      65.7    60  2763  6.03  5.99  3.95
 8  0.98 Fair  H     SI2      67.9    60  2777  6.05  5.97  4.08
 9  0.84 Fair  G     SI1      55.1    67  2782  6.39  6.2   3.47
10  1.01 Fair  E     I1       64.5    58  2788  6.29  6.21  4.03
# ℹ 1,600 more rows

Filter: if cut is Fair or Good, and Price is at or less than $600

diamonds %>%
  filter(cut == "Fair" | cut == "Good", price <= 600)
# A tibble: 505 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Good  E     VS1      56.9    65   327  4.05  4.07  2.31
 2  0.31 Good  J     SI2      63.3    58   335  4.34  4.35  2.75
 3  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 4  0.3  Good  J     SI1      64      55   339  4.25  4.28  2.73
 5  0.3  Good  J     SI1      63.4    54   351  4.23  4.29  2.7 
 6  0.3  Good  J     SI1      63.8    56   351  4.23  4.26  2.71
 7  0.3  Good  I     SI2      63.3    56   351  4.26  4.3   2.71
 8  0.23 Good  F     VS1      58.2    59   402  4.06  4.08  2.37
 9  0.23 Good  E     VS1      64.1    59   402  3.83  3.85  2.46
10  0.31 Good  H     SI1      64      54   402  4.29  4.31  2.75
# ℹ 495 more rows

Filter: if cut is Fair or Good, and Price is at or less than $600

diamonds %>%
  filter(cut %in% c("Fair", "Good"), price <= 600)
# A tibble: 505 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Good  E     VS1      56.9    65   327  4.05  4.07  2.31
 2  0.31 Good  J     SI2      63.3    58   335  4.34  4.35  2.75
 3  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 4  0.3  Good  J     SI1      64      55   339  4.25  4.28  2.73
 5  0.3  Good  J     SI1      63.4    54   351  4.23  4.29  2.7 
 6  0.3  Good  J     SI1      63.8    56   351  4.23  4.26  2.71
 7  0.3  Good  I     SI2      63.3    56   351  4.26  4.3   2.71
 8  0.23 Good  F     VS1      58.2    59   402  4.06  4.08  2.37
 9  0.23 Good  E     VS1      64.1    59   402  3.83  3.85  2.46
10  0.31 Good  H     SI1      64      54   402  4.29  4.31  2.75
# ℹ 495 more rows

This is effectively identical to subset in base R, except for some minute differences.

Basic Data Management: SELECT

Select: Retain only cut and color columns

diamonds %>% select(cut, color)
# A tibble: 53,940 × 2
   cut       color
   <ord>     <ord>
 1 Ideal     E    
 2 Premium   E    
 3 Good      E    
 4 Premium   I    
 5 Good      J    
 6 Very Good J    
 7 Very Good I    
 8 Very Good H    
 9 Fair      E    
10 Very Good H    
# ℹ 53,930 more rows

Select: Retain the first five columns

diamonds %>% select(1:5)
# A tibble: 53,940 × 5
   carat cut       color clarity depth
   <dbl> <ord>     <ord> <ord>   <dbl>
 1  0.23 Ideal     E     SI2      61.5
 2  0.21 Premium   E     SI1      59.8
 3  0.23 Good      E     VS1      56.9
 4  0.29 Premium   I     VS2      62.4
 5  0.31 Good      J     SI2      63.3
 6  0.24 Very Good J     VVS2     62.8
 7  0.24 Very Good I     VVS1     62.3
 8  0.26 Very Good H     SI1      61.9
 9  0.22 Fair      E     VS2      65.1
10  0.23 Very Good H     VS1      59.4
# ℹ 53,930 more rows
# or

diamonds %>% select(1,2,3,4,5)
# A tibble: 53,940 × 5
   carat cut       color clarity depth
   <dbl> <ord>     <ord> <ord>   <dbl>
 1  0.23 Ideal     E     SI2      61.5
 2  0.21 Premium   E     SI1      59.8
 3  0.23 Good      E     VS1      56.9
 4  0.29 Premium   I     VS2      62.4
 5  0.31 Good      J     SI2      63.3
 6  0.24 Very Good J     VVS2     62.8
 7  0.24 Very Good I     VVS1     62.3
 8  0.26 Very Good H     SI1      61.9
 9  0.22 Fair      E     VS2      65.1
10  0.23 Very Good H     VS1      59.4
# ℹ 53,930 more rows

Select: Retain all columns except cut

diamonds %>% select(-cut)
# A tibble: 53,940 × 9
   carat color clarity depth table price     x     y     z
   <dbl> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Select: Retain all columns except cut and color

diamonds %>% select(-cut, -color)
# A tibble: 53,940 × 8
   carat clarity depth table price     x     y     z
   <dbl> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Select: Retain all columns except the first five columns

diamonds %>% select (-1,-2,-3,-4,-5)
# A tibble: 53,940 × 5
   table price     x     y     z
   <dbl> <int> <dbl> <dbl> <dbl>
 1    55   326  3.95  3.98  2.43
 2    61   326  3.89  3.84  2.31
 3    65   327  4.05  4.07  2.31
 4    58   334  4.2   4.23  2.63
 5    58   335  4.34  4.35  2.75
 6    57   336  3.94  3.96  2.48
 7    57   336  3.95  3.98  2.47
 8    55   337  4.07  4.11  2.53
 9    61   337  3.87  3.78  2.49
10    61   338  4     4.05  2.39
# ℹ 53,930 more rows
# or
diamonds %>% select(-(1:5))
# A tibble: 53,940 × 5
   table price     x     y     z
   <dbl> <int> <dbl> <dbl> <dbl>
 1    55   326  3.95  3.98  2.43
 2    61   326  3.89  3.84  2.31
 3    65   327  4.05  4.07  2.31
 4    58   334  4.2   4.23  2.63
 5    58   335  4.34  4.35  2.75
 6    57   336  3.94  3.96  2.48
 7    57   336  3.95  3.98  2.47
 8    55   337  4.07  4.11  2.53
 9    61   337  3.87  3.78  2.49
10    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Select: Rearrange the columns so x, y, z and arranged first

diamonds %>% select(x,y,z, everything())
# A tibble: 53,940 × 10
       x     y     z carat cut       color clarity depth table price
   <dbl> <dbl> <dbl> <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
 1  3.95  3.98  2.43  0.23 Ideal     E     SI2      61.5    55   326
 2  3.89  3.84  2.31  0.21 Premium   E     SI1      59.8    61   326
 3  4.05  4.07  2.31  0.23 Good      E     VS1      56.9    65   327
 4  4.2   4.23  2.63  0.29 Premium   I     VS2      62.4    58   334
 5  4.34  4.35  2.75  0.31 Good      J     SI2      63.3    58   335
 6  3.94  3.96  2.48  0.24 Very Good J     VVS2     62.8    57   336
 7  3.95  3.98  2.47  0.24 Very Good I     VVS1     62.3    57   336
 8  4.07  4.11  2.53  0.26 Very Good H     SI1      61.9    55   337
 9  3.87  3.78  2.49  0.22 Fair      E     VS2      65.1    61   337
10  4     4.05  2.39  0.23 Very Good H     VS1      59.4    61   338
# ℹ 53,930 more rows

Basic Data Management: ARRANGE

Arrange: Cut in alphabetical order

diamonds %>% arrange(cut)
# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 2  0.86 Fair  E     SI2      55.1    69  2757  6.45  6.33  3.52
 3  0.96 Fair  F     SI2      66.3    62  2759  6.27  5.95  4.07
 4  0.7  Fair  F     VS2      64.5    57  2762  5.57  5.53  3.58
 5  0.7  Fair  F     VS2      65.3    55  2762  5.63  5.58  3.66
 6  0.91 Fair  H     SI2      64.4    57  2763  6.11  6.09  3.93
 7  0.91 Fair  H     SI2      65.7    60  2763  6.03  5.99  3.95
 8  0.98 Fair  H     SI2      67.9    60  2777  6.05  5.97  4.08
 9  0.84 Fair  G     SI1      55.1    67  2782  6.39  6.2   3.47
10  1.01 Fair  E     I1       64.5    58  2788  6.29  6.21  4.03
# ℹ 53,930 more rows

Arrange: Price from lowest to highest

diamonds %>% arrange(price)
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Arrange: Cut in descending alphabetical order

diamonds %>% arrange(desc(cut))
# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.23 Ideal J     VS1      62.8    56   340  3.93  3.9   2.46
 3  0.31 Ideal J     SI2      62.2    54   344  4.35  4.37  2.71
 4  0.3  Ideal I     SI2      62      54   348  4.31  4.34  2.68
 5  0.33 Ideal I     SI2      61.8    55   403  4.49  4.51  2.78
 6  0.33 Ideal I     SI2      61.2    56   403  4.49  4.5   2.75
 7  0.33 Ideal J     SI1      61.1    56   403  4.49  4.55  2.76
 8  0.23 Ideal G     VS1      61.9    54   404  3.93  3.95  2.44
 9  0.32 Ideal I     SI1      60.9    55   404  4.45  4.48  2.72
10  0.3  Ideal I     SI2      61      59   405  4.3   4.33  2.63
# ℹ 53,930 more rows

Arrange: Price from highest to lowest

diamonds %>% arrange(desc(price))
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
 4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
 5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
 6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
 7  2.04 Premium   H     SI1      58.1    60 18795  8.37  8.28  4.84
 8  2    Premium   I     VS1      60.8    59 18795  8.13  8.02  4.91
 9  1.71 Premium   F     VS2      62.3    59 18791  7.57  7.53  4.7 
10  2.15 Ideal     G     SI2      62.6    54 18791  8.29  8.35  5.21
# ℹ 53,930 more rows

Exercise!

Can you explain each step of this function?

diamonds %>%                         
  group_by(color, clarity) %>%       
  mutate(price200 = mean(price)) %>% 
  ungroup() %>%                      
  mutate(random10 = 10 + price) %>%  
  select(cut, color,                 
         clarity, price, 
         price200, random10) %>% 
  arrange(color) %>%                 
  group_by(cut) %>%                  
  mutate(dis = n_distinct(price),     
         rowID = row_number()) %>%   
  ungroup()                          
# A tibble: 53,940 × 8
   cut       color clarity price price200 random10   dis rowID
   <ord>     <ord> <ord>   <int>    <dbl>    <dbl> <int> <int>
 1 Very Good D     VS2       357    2587.      367  5840     1
 2 Very Good D     VS1       402    3030.      412  5840     2
 3 Very Good D     VS2       403    2587.      413  5840     3
 4 Good      D     VS2       403    2587.      413  3086     1
 5 Good      D     VS1       403    3030.      413  3086     2
 6 Premium   D     VS2       404    2587.      414  6014     1
 7 Premium   D     SI1       552    2976.      562  6014     2
 8 Ideal     D     SI1       552    2976.      562  7281     1
 9 Ideal     D     SI1       552    2976.      562  7281     2
10 Very Good D     VVS1      553    2948.      563  5840     4
# ℹ 53,930 more rows
diamonds %>%                         # utilizes the diamonds dataset
  group_by(color, clarity) %>%       # groups data by color and clarity variables
  mutate(price200 = mean(price)) %>% # creates new variable (average price by groups)
  ungroup() %>%                      # data no longer grouped by color and clarity
  mutate(random10 = 10 + price) %>%  # new variable, original price + $10
  select(cut, color,                 # retain only these listed columns
         clarity, price, 
         price200, random10) %>% 
  arrange(color) %>%                 # visualize data ordered by color
  group_by(cut) %>%                  # group data by cut
  mutate(dis = n_distinct(price),    # counts the number of unique price values per cut 
         rowID = row_number()) %>%   # numbers each row consecutively for each cut
  ungroup()                          # final ungrouping of data
# A tibble: 53,940 × 8
   cut       color clarity price price200 random10   dis rowID
   <ord>     <ord> <ord>   <int>    <dbl>    <dbl> <int> <int>
 1 Very Good D     VS2       357    2587.      367  5840     1
 2 Very Good D     VS1       402    3030.      412  5840     2
 3 Very Good D     VS2       403    2587.      413  5840     3
 4 Good      D     VS2       403    2587.      413  3086     1
 5 Good      D     VS1       403    3030.      413  3086     2
 6 Premium   D     VS2       404    2587.      414  6014     1
 7 Premium   D     SI1       552    2976.      562  6014     2
 8 Ideal     D     SI1       552    2976.      562  7281     1
 9 Ideal     D     SI1       552    2976.      562  7281     2
10 Very Good D     VVS1      553    2948.      563  5840     4
# ℹ 53,930 more rows
midwest %>% 
  group_by(state) %>% 
  summarize(poptotalmean = mean(poptotal),
            poptotalmed = median(poptotal),
            popmax = max(poptotal),
            popmin = min(poptotal),
            popdistinct = n_distinct(poptotal),
            popfirst = first(poptotal),
            popany = any(poptotal < 5000),
            popany2 = any(poptotal > 2000000)) %>% 
  ungroup()
# A tibble: 5 × 9
  state poptotalmean poptotalmed  popmax popmin popdistinct popfirst popany
  <chr>        <dbl>       <dbl>   <int>  <int>       <int>    <int> <lgl> 
1 IL         112065.      24486. 5105067   4373         101    66090 TRUE  
2 IN          60263.      30362.  797159   5315          92    31095 FALSE 
3 MI         111992.      37308  2111687   1701          83    10145 TRUE  
4 OH         123263.      54930. 1412140  11098          88    25371 FALSE 
5 WI          67941.      33528   959275   3890          72    15682 TRUE  
# ℹ 1 more variable: popany2 <lgl>
midwest %>% 
  group_by(state) %>% 
  summarize(num5k = sum(poptotal < 5000),
            num2mil = sum(poptotal > 2000000),
            numrows = n()) %>% 
  ungroup()
# A tibble: 5 × 4
  state num5k num2mil numrows
  <chr> <int>   <int>   <int>
1 IL        1       1     102
2 IN        0       0      92
3 MI        1       1      83
4 OH        0       0      88
5 WI        2       0      72
midwest %>% 
  group_by(county) %>% 
  summarize(x = n_distinct(state)) %>% 
  arrange(desc(x)) %>% 
  ungroup()
# A tibble: 320 × 2
   county         x
   <chr>      <int>
 1 CRAWFORD       5
 2 JACKSON        5
 3 MONROE         5
 4 ADAMS          4
 5 BROWN          4
 6 CLARK          4
 7 CLINTON        4
 8 JEFFERSON      4
 9 LAKE           4
10 WASHINGTON     4
# ℹ 310 more rows
midwest %>% 
  group_by(county) %>% 
  summarize(x = n()) %>% 
  ungroup()
# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         4
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         2
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       2
# ℹ 310 more rows

Advanced Data Management: COUNT

Count: Number of observations

Collapses the rows and counts the number of observations per group of values.

diamonds %>% count(cut) 
# A tibble: 5 × 2
  cut           n
  <ord>     <int>
1 Fair       1610
2 Good       4906
3 Very Good 12082
4 Premium   13791
5 Ideal     21551
# is the same as

diamonds %>% group_by(cut) %>% count()
# A tibble: 5 × 2
# Groups:   cut [5]
  cut           n
  <ord>     <int>
1 Fair       1610
2 Good       4906
3 Very Good 12082
4 Premium   13791
5 Ideal     21551
# is the same as

diamonds %>% 
  group_by(cut) %>% 
  summarize(n = n())
# A tibble: 5 × 2
  cut           n
  <ord>     <int>
1 Fair       1610
2 Good       4906
3 Very Good 12082
4 Premium   13791
5 Ideal     21551

Count: For two columns

diamonds %>% count(cut, clarity)
# A tibble: 40 × 3
   cut   clarity     n
   <ord> <ord>   <int>
 1 Fair  I1        210
 2 Fair  SI2       466
 3 Fair  SI1       408
 4 Fair  VS2       261
 5 Fair  VS1       170
 6 Fair  VVS2       69
 7 Fair  VVS1       17
 8 Fair  IF          9
 9 Good  I1         96
10 Good  SI2      1081
# ℹ 30 more rows
# is the same as
diamonds %>% group_by(cut, clarity) %>% count()
# A tibble: 40 × 3
# Groups:   cut, clarity [40]
   cut   clarity     n
   <ord> <ord>   <int>
 1 Fair  I1        210
 2 Fair  SI2       466
 3 Fair  SI1       408
 4 Fair  VS2       261
 5 Fair  VS1       170
 6 Fair  VVS2       69
 7 Fair  VVS1       17
 8 Fair  IF          9
 9 Good  I1         96
10 Good  SI2      1081
# ℹ 30 more rows
diamonds %>% group_by(cut, clarity) %>% summarize(n = n())
# A tibble: 40 × 3
# Groups:   cut [5]
   cut   clarity     n
   <ord> <ord>   <int>
 1 Fair  I1        210
 2 Fair  SI2       466
 3 Fair  SI1       408
 4 Fair  VS2       261
 5 Fair  VS1       170
 6 Fair  VVS2       69
 7 Fair  VVS1       17
 8 Fair  IF          9
 9 Good  I1         96
10 Good  SI2      1081
# ℹ 30 more rows

Count: Filter greater than 1

#install.packages("nycflights13")
library(nycflights13)

planes %>% 
  count(tailnum) %>% 
  filter(n > 1)
# A tibble: 0 × 2
# ℹ 2 variables: tailnum <chr>, n <int>

Count: Filter greater than 1 counts across multiple columns

weather %>% 
  count(year, month, day, hour, origin) %>% 
  filter(n > 1)
# A tibble: 3 × 6
   year month   day  hour origin     n
  <int> <int> <int> <int> <chr>  <int>
1  2013    11     3     1 EWR        2
2  2013    11     3     1 JFK        2
3  2013    11     3     1 LGA        2

Advanced Data Management: SEPARATE/UNITE

Import

## download file 
conservation <- read_csv("https://raw.githubusercontent.com/suzanbaert/Dplyr_Tutorials/master/conservation_explanation.csv")

## take a look at this file
conservation
# A tibble: 11 × 1
   `conservation abbreviation`                  
   <chr>                                        
 1 EX = Extinct                                 
 2 EW = Extinct in the wild                     
 3 CR = Critically Endangered                   
 4 EN = Endangered                              
 5 VU = Vulnerable                              
 6 NT = Near Threatened                         
 7 LC = Least Concern                           
 8 DD = Data deficient                          
 9 NE = Not evaluated                           
10 PE = Probably extinct (informal)             
11 PEW = Probably extinct in the wild (informal)

Separate by “=”

conservation %>%
  separate(`conservation abbreviation`, 
           into = c("abbreviation", "description"), sep = " = ")
# A tibble: 11 × 2
   abbreviation description                            
   <chr>        <chr>                                  
 1 EX           Extinct                                
 2 EW           Extinct in the wild                    
 3 CR           Critically Endangered                  
 4 EN           Endangered                             
 5 VU           Vulnerable                             
 6 NT           Near Threatened                        
 7 LC           Least Concern                          
 8 DD           Data deficient                         
 9 NE           Not evaluated                          
10 PE           Probably extinct (informal)            
11 PEW          Probably extinct in the wild (informal)

Unite by “=”

conservation %>%
  separate(`conservation abbreviation`, 
           into = c("abbreviation", "description"), sep = " = ") %>%
  unite(united_col, abbreviation, description, sep = " = ")
# A tibble: 11 × 1
   united_col                                   
   <chr>                                        
 1 EX = Extinct                                 
 2 EW = Extinct in the wild                     
 3 CR = Critically Endangered                   
 4 EN = Endangered                              
 5 VU = Vulnerable                              
 6 NT = Near Threatened                         
 7 LC = Least Concern                           
 8 DD = Data deficient                          
 9 NE = Not evaluated                           
10 PE = Probably extinct (informal)             
11 PEW = Probably extinct in the wild (informal)

Advanced Data Management: CLEAN NAMES

This function takes the existing column names of your dataset, converts them all to lowercase letters and numbers, and separates all words using the underscore character. For example, there is a space in the column name for conservation.

#install.packages("janitor")
library(janitor)
conservation %>%
  clean_names()
# A tibble: 11 × 1
   conservation_abbreviation                    
   <chr>                                        
 1 EX = Extinct                                 
 2 EW = Extinct in the wild                     
 3 CR = Critically Endangered                   
 4 EN = Endangered                              
 5 VU = Vulnerable                              
 6 NT = Near Threatened                         
 7 LC = Least Concern                           
 8 DD = Data deficient                          
 9 NE = Not evaluated                           
10 PE = Probably extinct (informal)             
11 PEW = Probably extinct in the wild (informal)

Advanced Data Management: RENAME

Rename a single variable

diamonds %>% rename(PRICE = price)
# A tibble: 53,940 × 10
   carat cut       color clarity depth table PRICE     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows
# is the same as

diamonds %>% 
  mutate(PRICE = price) %>% # creates new variable based on old variable
  select(-price) # removes old variable from dataset
# A tibble: 53,940 × 10
   carat cut       color clarity depth table     x     y     z PRICE
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl> <int>
 1  0.23 Ideal     E     SI2      61.5    55  3.95  3.98  2.43   326
 2  0.21 Premium   E     SI1      59.8    61  3.89  3.84  2.31   326
 3  0.23 Good      E     VS1      56.9    65  4.05  4.07  2.31   327
 4  0.29 Premium   I     VS2      62.4    58  4.2   4.23  2.63   334
 5  0.31 Good      J     SI2      63.3    58  4.34  4.35  2.75   335
 6  0.24 Very Good J     VVS2     62.8    57  3.94  3.96  2.48   336
 7  0.24 Very Good I     VVS1     62.3    57  3.95  3.98  2.47   336
 8  0.26 Very Good H     SI1      61.9    55  4.07  4.11  2.53   337
 9  0.22 Fair      E     VS2      65.1    61  3.87  3.78  2.49   337
10  0.23 Very Good H     VS1      59.4    61  4     4.05  2.39   338
# ℹ 53,930 more rows

Rename multiple variables

diamonds %>% rename(Length = x, Width = y, Depth = z)
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price Length Width Depth
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>  <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326   3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326   3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327   4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334   4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335   4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336   3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336   3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337   4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337   3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338   4     4.05  2.39
# ℹ 53,930 more rows
# is the same as

diamonds %>% 
  mutate(Length = x, Width = y, Depth = z) %>% 
  select(-x, -y, -z)
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price Length Width Depth
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>  <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326   3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326   3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327   4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334   4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335   4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336   3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336   3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337   4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337   3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338   4     4.05  2.39
# ℹ 53,930 more rows

Advanced Data Management: CASE WHEN

If divisble by a number, assign a name

A vectorized form of if else

x <- 1:70
case_when(
  x %% 35 == 0 ~ "fizz buzz",
  x %% 5 == 0 ~ "fizz",
  x %% 7 == 0 ~ "buzz",
  .default = as.character(x)
)
 [1] "1"         "2"         "3"         "4"         "fizz"      "6"        
 [7] "buzz"      "8"         "9"         "fizz"      "11"        "12"       
[13] "13"        "buzz"      "fizz"      "16"        "17"        "18"       
[19] "19"        "fizz"      "buzz"      "22"        "23"        "24"       
[25] "fizz"      "26"        "27"        "buzz"      "29"        "fizz"     
[31] "31"        "32"        "33"        "34"        "fizz buzz" "36"       
[37] "37"        "38"        "39"        "fizz"      "41"        "buzz"     
[43] "43"        "44"        "fizz"      "46"        "47"        "48"       
[49] "buzz"      "fizz"      "51"        "52"        "53"        "54"       
[55] "fizz"      "buzz"      "57"        "58"        "59"        "fizz"     
[61] "61"        "62"        "buzz"      "64"        "fizz"      "66"       
[67] "67"        "68"        "69"        "fizz buzz"

Reassign name if a certain height or species

starwars %>%
  select(name:mass, gender, species) %>%
  mutate(
    type = case_when(
      height > 200 | mass > 200 ~ "large",
      species == "Droid" ~ "robot",
      .default = "other"
    )
  )
# A tibble: 87 × 6
   name               height  mass gender    species type 
   <chr>               <int> <dbl> <chr>     <chr>   <chr>
 1 Luke Skywalker        172    77 masculine Human   other
 2 C-3PO                 167    75 masculine Droid   robot
 3 R2-D2                  96    32 masculine Droid   robot
 4 Darth Vader           202   136 masculine Human   large
 5 Leia Organa           150    49 feminine  Human   other
 6 Owen Lars             178   120 masculine Human   other
 7 Beru Whitesun lars    165    75 feminine  Human   other
 8 R5-D4                  97    32 masculine Droid   robot
 9 Biggs Darklighter     183    84 masculine Human   other
10 Obi-Wan Kenobi        182    77 masculine Human   other
# ℹ 77 more rows

Create data

#create data frame
df <- data.frame(player = c('AJ', 'Bob', 'Chad', 'Dan', 'Eric', 'Frank'),
                 position = c('G', 'F', 'F', 'G', 'C', NA),
                 points = c(12, 15, 19, 22, 32, NA),
                 assists = c(5, 7, 7, 12, 11, NA))

#view data frame
df
  player position points assists
1     AJ        G     12       5
2    Bob        F     15       7
3   Chad        F     19       7
4    Dan        G     22      12
5   Eric        C     32      11
6  Frank     <NA>     NA      NA

Create New Variable from One Existing Variable

df %>%
  mutate(quality = case_when(points > 20 ~ 'high',
                             points > 15 ~ 'med',
                             TRUE ~ 'low' ))
  player position points assists quality
1     AJ        G     12       5     low
2    Bob        F     15       7     low
3   Chad        F     19       7     med
4    Dan        G     22      12    high
5   Eric        C     32      11    high
6  Frank     <NA>     NA      NA     low

Create New Variable from Multiple Variables

df %>%
  mutate(quality = case_when(points > 15 & assists > 10 ~ 'great',
                             points > 15 & assists > 5 ~ 'good',
                             TRUE ~ 'average' ))
  player position points assists quality
1     AJ        G     12       5 average
2    Bob        F     15       7 average
3   Chad        F     19       7    good
4    Dan        G     22      12   great
5   Eric        C     32      11   great
6  Frank     <NA>     NA      NA average

Advanced Data Management: JOIN

Separate information

## take conservation dataset and separate information
## into two columns
## call that new object `conserve`
conserve <- conservation %>%
  separate(`conservation abbreviation`, 
           into = c("abbreviation", "description"), sep = " = ")
conserve
# A tibble: 11 × 2
   abbreviation description                            
   <chr>        <chr>                                  
 1 EX           Extinct                                
 2 EW           Extinct in the wild                    
 3 CR           Critically Endangered                  
 4 EN           Endangered                             
 5 VU           Vulnerable                             
 6 NT           Near Threatened                        
 7 LC           Least Concern                          
 8 DD           Data deficient                         
 9 NE           Not evaluated                          
10 PE           Probably extinct (informal)            
11 PEW          Probably extinct in the wild (informal)

Left join

## now lets join the two datasets together
msleep %>%
  mutate(conservation = toupper(conservation)) %>%
  left_join(conserve, by = c("conservation" = "abbreviation"))
# A tibble: 83 × 12
   name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
   <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
 1 Cheet… Acin… carni Carn… LC                  12.1      NA        NA      11.9
 2 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA       7  
 3 Mount… Aplo… herbi Rode… NT                  14.4       2.4      NA       9.6
 4 Great… Blar… omni  Sori… LC                  14.9       2.3       0.133   9.1
 5 Cow    Bos   herbi Arti… DOMESTICATED         4         0.7       0.667  20  
 6 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
 7 North… Call… carni Carn… VU                   8.7       1.4       0.383  15.3
 8 Vespe… Calo… <NA>  Rode… <NA>                 7        NA        NA      17  
 9 Dog    Canis carni Carn… DOMESTICATED        10.1       2.9       0.333  13.9
10 Roe d… Capr… herbi Arti… LC                   3        NA        NA      21  
# ℹ 73 more rows
# ℹ 3 more variables: brainwt <dbl>, bodywt <dbl>, description <chr>

Advanced Data Management: JOIN

Generate data

data1 <- data.frame(ID = 1:2,                      # Create first example data frame
                    X1 = c("a1", "a2"),
                    stringsAsFactors = FALSE)
data2 <- data.frame(ID = 2:3,                      # Create second example data frame
                    X2 = c("b1", "b2"),
                    stringsAsFactors = FALSE)

Examples

Join R Functions of dplyr Package Overview

Inner join

Return all rows where there are matches from both

inner_join(data1, data2, by = "ID")                # Apply inner_join dplyr function
  ID X1 X2
1  2 a2 b1

Inner Join Depiction

R inner_join dplyr Package Function

R inner_join dplyr Package Function

Left Join

left_join retains all rows of the data table, which is inserted first into the function (i.e. the X-data)

left_join(data1, data2, by = "ID")                 # Apply left_join dplyr function
  ID X1   X2
1  1 a1 <NA>
2  2 a2   b1

Left Join Depiction

R left_join dplyr Package Function

R left_join dplyr Package Function

Right Join

right_join function retains all rows of the data on the right side (i.e. the Y-data). If you compare left join vs. right join, you can see that both functions are keeping the rows of the opposite data.

right_join(data1, data2, by = "ID")                # Apply right_join dplyr function
  ID   X1 X2
1  2   a2 b1
2  3 <NA> b2

Right Join Depiction

R right_join dplyr Package Function

R right_join dplyr Package Function

Full Join

full_join functions retains all rows of both input data sets and inserts NA when an ID is missing in one of the data frames.

full_join(data1, data2, by = "ID")                 # Apply full_join dplyr function
  ID   X1   X2
1  1   a1 <NA>
2  2   a2   b1
3  3 <NA>   b2

Full Join Depiction

R full_join dplyr Package Function

R full_join dplyr Package Function

Filter Joins: semi_join and anti_join

The next two join functions (i.e. semi_join and anti_join) are so called filtering joins. Filtering joins keep cases from the left data table (i.e. the X-data) and use the right data (i.e. the Y-data) as filter.

Advanced Data Management: PIVOT

Wide vs. Long

Wide: Data are often entered and stored in a “wide” format - where a subject’s characteristics or responses are stored in a single row. While this may be useful for presentation, it is not ideal for some types of analysis.

Depiction

Long Depiction

Inspect data

mtcars <- tibble::rownames_to_column(mtcars, "Model")
mtcars
                 Model  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1            Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
2        Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
3           Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
4       Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
5    Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
6              Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
7           Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
8            Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
9             Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
10            Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
11           Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
12          Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
13          Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
14         Merc 450SLC 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
15  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
16 Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
17   Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
18            Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
19         Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
20      Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
21       Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
22    Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
23         AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
24          Camaro Z28 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
25    Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
26           Fiat X1-9 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
27       Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
28        Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
29      Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
30        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
31       Maserati Bora 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
32          Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Pivot columns to longer

mtcars_long <- pivot_longer(mtcars,
             cols = mpg:carb,
             names_to = "Category",
             values_to = "Value"
)
mtcars_long
# A tibble: 352 × 3
   Model     Category  Value
   <chr>     <chr>     <dbl>
 1 Mazda RX4 mpg       21   
 2 Mazda RX4 cyl        6   
 3 Mazda RX4 disp     160   
 4 Mazda RX4 hp       110   
 5 Mazda RX4 drat       3.9 
 6 Mazda RX4 wt         2.62
 7 Mazda RX4 qsec      16.5 
 8 Mazda RX4 vs         0   
 9 Mazda RX4 am         1   
10 Mazda RX4 gear       4   
# ℹ 342 more rows

Inspect data

us_rent_income
# A tibble: 104 × 5
   GEOID NAME       variable estimate   moe
   <chr> <chr>      <chr>       <dbl> <dbl>
 1 01    Alabama    income      24476   136
 2 01    Alabama    rent          747     3
 3 02    Alaska     income      32940   508
 4 02    Alaska     rent         1200    13
 5 04    Arizona    income      27517   148
 6 04    Arizona    rent          972     4
 7 05    Arkansas   income      23789   165
 8 05    Arkansas   rent          709     5
 9 06    California income      29454   109
10 06    California rent         1358     3
# ℹ 94 more rows

Pivot data

rent_income_wide <- pivot_wider(us_rent_income,
             # id_cols = optional vector of unaffected columns,
             names_from = c(variable),
             values_from = c(estimate, moe),
             names_sep = "_"         
)
rent_income_wide
# A tibble: 52 × 6
   GEOID NAME                 estimate_income estimate_rent moe_income moe_rent
   <chr> <chr>                          <dbl>         <dbl>      <dbl>    <dbl>
 1 01    Alabama                        24476           747        136        3
 2 02    Alaska                         32940          1200        508       13
 3 04    Arizona                        27517           972        148        4
 4 05    Arkansas                       23789           709        165        5
 5 06    California                     29454          1358        109        3
 6 08    Colorado                       32401          1125        109        5
 7 09    Connecticut                    35326          1123        195        5
 8 10    Delaware                       31560          1076        247       10
 9 11    District of Columbia           43198          1424        681       17
10 12    Florida                        25952          1077         70        3
# ℹ 42 more rows

Import another Example

tbi_age <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-24/tbi_age.csv')

Summarize Long Data

tbi_age %>%
  dplyr::group_by(age_group) %>%
  dplyr::summarise(sum_num = sum(number_est, na.rm = TRUE))
# A tibble: 11 × 2
   age_group sum_num
   <chr>       <dbl>
 1 0-17       836718
 2 0-4        331403
 3 15-24      476652
 4 25-34      319303
 5 35-44      240462
 6 45-54      265417
 7 5-14       348913
 8 55-64      241511
 9 65-74      211035
10 75+        442377
11 Total     2877539

Pivot Wider

tbi_age %>%
  dplyr::select(type, injury_mechanism, age_group, number_est) %>%
  tidyr::pivot_wider(names_from = age_group, values_from = number_est)
# A tibble: 21 × 13
   type    injury_mechanism `0-17`  `0-4` `5-14` `15-24` `25-34` `35-44` `45-54`
   <chr>   <chr>             <dbl>  <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 Emerge… Motor Vehicle C…  47138   5464  19785  103892   71641   44108   40020
 2 Emerge… Unintentional F… 397190 230776 133084   96568   70210   68830   95127
 3 Emerge… Unintentionally… 229236  53436 120839  106679   44404   32479   30495
 4 Emerge… Other unintenti…  55785  12007  30656   37118   22360   17541   17808
 5 Emerge… Intentional sel…     NA     NA     NA     870     650     421     247
 6 Emerge… Assault           24360    674   9690   65399   57213   34100   27682
 7 Emerge… Other or no mec…  57983  19360  26022   33395   20974   16503   15962
 8 Hospit… Motor Vehicle C…   5830    870   2395   12925   11050    7305    8490
 9 Hospit… Unintentional F…   7935   4700   2270    3910    4470    5640   12010
10 Hospit… Unintentionally…   1985    510    980    1070     635     610     685
# ℹ 11 more rows
# ℹ 4 more variables: `55-64` <dbl>, `65-74` <dbl>, `75+` <dbl>, Total <dbl>

Generate data

data <- data.frame(ID1 = LETTERS[1:4],        # Create example data
                   ID2 = rep(letters[1:3], each = 4),
                   x = 1:12,
                   y = 21:32)
data                                          # Print example data
   ID1 ID2  x  y
1    A   a  1 21
2    B   a  2 22
3    C   a  3 23
4    D   a  4 24
5    A   b  5 25
6    B   b  6 26
7    C   b  7 27
8    D   b  8 28
9    A   c  9 29
10   B   c 10 30
11   C   c 11 31
12   D   c 12 32

Pivot longer

data_long <- pivot_longer(data = data,        # Convert wide to long
                          cols = c("x", "y"))
data_long                                     # Print long data
# A tibble: 24 × 4
   ID1   ID2   name  value
   <chr> <chr> <chr> <int>
 1 A     a     x         1
 2 A     a     y        21
 3 B     a     x         2
 4 B     a     y        22
 5 C     a     x         3
 6 C     a     y        23
 7 D     a     x         4
 8 D     a     y        24
 9 A     b     x         5
10 A     b     y        25
# ℹ 14 more rows

Pivot wider

data_wide <- pivot_wider(data = data_long)    # Convert long to wide
data_wide                                     # Print wide data
# A tibble: 12 × 4
   ID1   ID2       x     y
   <chr> <chr> <int> <int>
 1 A     a         1    21
 2 B     a         2    22
 3 C     a         3    23
 4 D     a         4    24
 5 A     b         5    25
 6 B     b         6    26
 7 C     b         7    27
 8 D     b         8    28
 9 A     c         9    29
10 B     c        10    30
11 C     c        11    31
12 D     c        12    32

Model Summary with Broom

Standard print

m <- lm(mpg ~ wt, mtcars)
summary(m)

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Pretty broom/tidy print

library(broom)
tidy(m)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    37.3      1.88      19.9  8.24e-19
2 wt             -5.34     0.559     -9.56 1.29e-10

Purrr

Replacing for loops with more concise automation

library(purrr)
1:10 |>
    map(\(x) rnorm(10, x))
[[1]]
 [1] 1.2753262 1.2532608 1.5152521 0.9632092 1.1822736 1.9727132 1.2148138
 [8] 1.1607379 0.8689666 0.5555470

[[2]]
 [1] 1.9551079 1.3365568 1.2316640 2.3326519 3.1243529 2.3953717 4.1700423
 [8] 0.5872410 0.2959162 1.7703495

[[3]]
 [1] 2.2391719 2.9379073 5.2970526 3.7504805 4.2398555 0.9076744 3.0377226
 [8] 4.3998811 2.1246664 2.9401894

[[4]]
 [1] 4.627270 1.394210 2.458875 3.188346 3.132968 4.087950 3.478214 3.277804
 [9] 4.090909 3.935028

[[5]]
 [1] 4.628467 4.919631 6.993296 4.567888 4.345368 5.459781 4.528624 5.830616
 [9] 6.160098 3.666932

[[6]]
 [1] 5.999579 5.897450 6.247411 4.400042 4.787289 6.347242 6.353750 5.681297
 [9] 5.705933 6.196801

[[7]]
 [1] 6.798277 6.705364 7.070292 6.843013 7.111320 7.556265 5.455063 5.737431
 [9] 6.483021 7.262715

[[8]]
 [1] 8.020697 7.250144 8.190980 7.393124 9.294098 8.168119 7.507259 7.957210
 [9] 9.648788 7.040450

[[9]]
 [1] 10.052380 10.180445  9.223869  8.939108  8.378289  8.416154  8.018666
 [8]  7.537818  8.970469  9.238761

[[10]]
 [1] 10.122628  9.453996  8.297895  9.534022  9.718270  8.616959  9.780240
 [8]  9.594422  9.949405  9.301528
map_dbl(trees, median)
 Girth Height Volume 
  12.9   76.0   24.2 

Multiple vectors

# generate volume function
volume <- function(diameter, height){
  # convert diameter in inches to raidus in feet
  radius_ft <- (diameter/2)/12
  # calculate volume
  output <- pi * radius_ft^2 * height
  return(output)
}
map2_dbl(trees$Girth, trees$Height, volume)
 [1]  26.30157  26.22030  26.60929  43.29507  50.58013  52.80232  43.55687
 [8]  49.49645  53.76050  51.31268  55.01883  53.87046  53.87046  51.51672
[15]  58.90486  67.16431  77.14819  82.97153  72.68200  66.47610  83.38311
[22]  87.98205  84.85845 100.53096 111.58179 132.22227 136.96744 139.80524
[29] 141.37167 141.37167 201.36365